Handling many Hierarchical Matrix calculations using one CUDA kernel launch

Hierarchical matrices are a powerful tool for representing large matrices, that it uses a combination of dense and low-rank blocks to store matrices in accordance to their ranks, while still maintaining a high numerical accuracy. There exist many researches on how to perform standard matrix operations directly on the hierarchical structure, both on CPU and heterogeneous systems.

The current day heterogeneous matrix calculation libraries, such as MAGMA and CUBLAS, already have very efficient methods doing dense matrix calculations. Building a new heterogeneous h-matrix based matrix library from these dense libraries has potential performance limitations:

Hierarchical matrix operations all have different problem sizes, which not all blocks in the hierarchical matrix divides into equally sized blocks. Solving a hierarchical problem includes building up the small solutions from dense matrix and low rank matrix solutions. If the problem size is small, launching many routines from the dense libraries could build up into a performance-costly move eventually.

On the other hand, launching many kernels may potentially cause the communication cost between host and device to grow significantly, which there will be many synchronization calls made through the PCI-Express bus. In systems that have multiple GPUs installed, the data needs to move frequently between host and GPU devices, and among different devices.

In our research, we are proposing to build a brand new kernel that specializes in handling a hierarchical matrix calculation. Using the one single kernel launch to solve an adequately sized hierarchical matrix can potentially help alleviating the communication costs between the host and devices, and resizing the hierarchically divided problems into more consistent problem sizes, due to the fact that a hierarchically partitioned matrix consists of many smaller hierarchical matrices.

Our approach consists of a routine that execute heterogeneously: The CPU only reads the hierarchical structure of the matrix, and build up the necessary information, such as the addresses of the blocks, the block dimensions, and what kind of calculation etc. and pass them to the GPUs. After all information is gathered, CPU initiates the kernel and each GPU executes its corresponding task and starts performing changes to the data.

The hierarchical information gathering done on CPUs consists of majorly 4 parts:

1. Generation of a tree of hierarchical operations from the hierarchical structure. The generated tree consists many hierarchical operations as intermediate nodes too, but eventually they all divide into dense or low-rank operations. The tree flattens and reduces into a list of dense and low-rank operations.

2. From the list of operations, CPU generates a DAG to find data dependencies that exist between entries of the list.

3. According to the DAG and the kernel launching configurations, CPU schedules the list of operations into individual instruction memory that dedicates to each worker (in the CUDA context: thread blocks), in a way that both parallelize the list of instructions, while maintaining the data dependencies.

4. As CPU already collected all information that the GPU needs, it copies the individual instruction memories, the actual matrix data and their corresponding location pointers into CUDA global memory, while at the same time allocates a room for thread blocks to communicate. The kernel is now ready to launch.